This is the final project for CMU 95_778 R for Data Science. The kaggle competition we choose is Costa Rican Household Poverty Level Prediction held by the Intra-American Development Bank and Kaggle. The project GitLab repository is available here.

Business Problem Overview:

Inter-American Development Bank is working on a social program to aid impoverished individuals in Costa Rica. Impoverished individuals are usually unlikely to have direct financial documents(income or expenses) to prove that they truly need the aid. Current models (PMT), which take observable household attributes into account (eg.how many rooms in a house? how many households? ) still needs improvement in terms of prediction accuracy. Our team is working on solving these challenges and helping Inter-American Development Bank identify the most impoverished individuals to give aid to.

Business Question to Address:

What are the most important variables that goes into identifying household poverty level? What model to recommend to improve household poverty identification accuracy for the Inter-American Development Bank to offer aiding programs to the people that are most in need?

Let’s start!

library(tidyverse)
library(caret)
library(ggplot2)
library(randomForest)
library(scales)
library(reshape2)
library(e1071)
library(DataExplorer)
library(corrplot)
library(gridExtra)
library(nnet)
library(factoextra)
library(gbm)

poverty=read_csv("Train.csv")
poverty_test=read_csv("test.csv")

Explore and Clean Data

introduce(poverty)
introduce(poverty_test)

The datasets contain 142 features and 1 label (only for the training set). The datasets give comprehensive description of a household surveyed, and the features contain information about several aspects of important information of a household: including household finance, household items, house condition, family composition, education, and basic demographics.

Below is a list of some representative features:

Household Finance:

v2a1, Monthly rent payment


Household Items:

v18q1, number of tablets household owns

refrig, =1 if the household has refrigerator


House Condition:

rooms, number of all rooms in the house

paredblolad, =1 if predominant material on the outside wall is block or brick


Family Composition:

hhsize, household size

tamviv, number of persons living in the household

hogar_adul, Number of adults in household


Education:

instlevel1, =1 no level of education

table(poverty$Target)
## 
##    1    2    3    4 
##  755 1597 1209 5996
ggplot(poverty)+geom_bar(aes(Target), fill = "#FF6666")+ggtitle("Distribution of Label")+theme(plot.title = element_text(hjust = 0.5))+scale_y_continuous(labels = comma)

The label “Target” has 4 levels, coded as: “1”, “2”, “3” and “4”, where “1” stands for extreme poverty and “4” stands for non vulnerable households.

The training set has 9557 observations and the testing set has 23856 observations. In order to obtain a higher accuracy on the testing set, on the one hand, our model should include as much information as possible, and on the other hand, we should restrict our model to be overfitting.

Missing Plot

poverty_missing=poverty %>%  select(everything()) %>% summarise_all(funs(sum(is.na(.)))) %>% gather(variables,missing_amount) %>% mutate(percent=missing_amount/nrow(poverty)) %>% 
  arrange(percent) %>% mutate(levels=case_when(percent>=0.8~"Remove",
                                              percent>=0.4 & percent<0.8~"Bad",
                                              percent>=0.05 & percent<0.4~"OK",
                                              percent<0.05~"Good")) %>% filter(missing_amount>0)

ggplot(poverty_missing, aes(x = reorder(variables, -percent), y = missing_amount,fill=levels)) + geom_bar(stat = "identity") + coord_flip()+ scale_fill_manual(values=c("#1a9641", "#a6d96a", "#d7191c"))+ xlab("Features") + ylab("Number of missing rows") +   ggtitle("Missing variable plot")+ 
  geom_text(aes(label = paste0(round(100 * percent, 2), "%"))) + scale_y_continuous(labels = comma)+theme(plot.title = element_text(hjust = 0.5))

From the missing plot, we can see there appear NAs in 5 variables and 3 of them have a large proportion of NAs, therefore on the later stage, we tend to drop these three variables and replace NAs for the remaining two.

By the way, we find the data has already been dummified, like below:

lugar1, =1 region Central

lugar2, =1 region Chorotega

lugar3, =1 region PacÃÂfico central

lugar4, =1 region Brunca

lugar5, =1 region Huetar Atlántica

lugar6, =1 region Huetar Norte

Therefore in order to reduce the time for training some certain models (for example, random forest), we may reconstruct these features to make reduce the dimension of features.

Features Distribution Analysis

We want to see how household variables that discribe different perspectives of household are distributed in our model. Some are normally distirbuted, some are left-skewed, and some do not possess certain patterns. And we use different colors to differenciate different levels of poverty.(with 1 being the extremely poor individuals, with 4 being the non vulnerables)

The variable that represent number of all rooms in the house are normally distributed.
The variable that represent size of the household and number of mobile phones are left-skewed.
The variable that represent years of years of schooling do not possess a certain pattern.

Within each distribution, we can see that the non vulnerables group takes up most percentanges. This is due to the fact that our datasets naturally contains more non vulnerables.

Later in the Pinciple Component Analysis Section, we will choose 10 prinicipal components that explain the variances most. Thus, we would not bother to transform the data for our multinomial logistics model in our project.

a<-ggplot(poverty,aes(x=rooms))+geom_bar(aes(fill=as.factor(Target)))+ylab("Count")+xlab("Number of All Rooms in the House")+labs(fill="Poverty")

b<-ggplot(poverty,aes(x=tamhog))+geom_bar(aes(fill=as.factor(Target)))+ylab("Count")+xlab("Size of the Household")+labs(fill="Poverty")

c<-ggplot(poverty,aes(x=escolari))+geom_bar(aes(fill=as.factor(Target)))+ylab("Count")+xlab("Years of Schooling")+labs(fill="Poverty")

d<-ggplot(poverty,aes(x=qmobilephone))+geom_bar(aes(fill=as.factor(Target)))+ylab("Count")+xlab("Number of Mobile Phones")+labs(fill="Poverty")

grid.arrange(a,b,c,d, nrow = 2,top="Variables Distribution Analysis")

Let’s see among the 142 variables, which variables have the high relations with the degree of poverty (described as the variable called “Target”,the bigger the number, the less of the degree of poverty). For the first step, we can pick some variables to draw precise boxplots and we notice that years of schooling (described as the variable called escolari), and degree of crowding in the house (described as the variable called overcrowding) can have some trends with degree of poverty (some features of these variables change as the degree of poverty goes up). Thus, they should be highly focused when doing the model.

p1 <- ggplot(poverty, aes(x = as.factor(Target), y = rooms)) +
  geom_boxplot(aes(fill=as.factor(Target))) +
  labs(fill="degree_of_poverty")+
  coord_flip()+
  labs( x="degree of poverty")+
  ggtitle("Frequency distribution - number of rooms in house") +
  theme(plot.title = element_text(hjust = 0.5, size = 10))

p2 <- ggplot(poverty, aes(x = as.factor(Target), y = hogar_total)) +
  geom_boxplot(aes(fill=as.factor(Target))) +
  labs(fill="degree_of_poverty")+
  coord_flip()+
  labs( x="degree of poverty")+
  ggtitle("Frequency distribution - number of total individual in household") +
  theme(plot.title = element_text(hjust = 0.5, size = 10))

grid.arrange(p1,p2,nrow=2)

p3 <- ggplot(poverty, aes(x = as.factor(Target), y = SQBdependency)) +
  geom_boxplot(aes(fill=as.factor(Target))) +
  labs(fill="degree_of_poverty")+
  coord_flip()+
  labs( x="degree of poverty")+
  ggtitle("Frequency distribution - dependency squared") +
  theme(plot.title = element_text(hjust = 0.5))

p4 <- ggplot(poverty, aes(x = as.factor(Target), y = escolari)) +
  geom_boxplot(aes(fill=as.factor(Target))) +
  labs(fill="degree_of_poverty")+
  coord_flip()+
  labs( x="degree of poverty")+
  ggtitle("Frequency distribution - year of schooling") +
  theme(plot.title = element_text(hjust = 0.5))

p5 <- ggplot(poverty, aes(x = as.factor(Target), y = overcrowding)) +
  geom_boxplot(aes(fill=as.factor(Target))) +
  labs(fill="degree_of_poverty")+
  coord_flip()+
  labs( x="degree of poverty")+
  ggtitle("Frequency distribution - degree of overcrowding") +
  theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p3,p4,p5,nrow=3)

Then we try to use scatterplot to analyze the relationship between variables, we tried a lot and found out that “tamhog (size of the household)” and “tamviv (number of persons living in the household)”, “r4t2 (persons 12 years of age and older)” and “hogar_adul (Number of adults in household)” have a strong positive relationship which can be seen in the following scatterplots. Thus, when doing the model, we should consider relationships between variables such as these ones.

p6 <- ggplot(poverty, aes(x = tamhog, y = tamviv,color=as.factor(Target))) +
  geom_point(alpha = 0.3) +
  labs(color="degree_of_poverty")+
  coord_flip()+
  ggtitle("Scatterplot -relationship_tamhog_tamviv") +
  theme(plot.title = element_text(hjust = 0.5))

p7 <- ggplot(poverty, aes(x = hogar_adul, y = r4t2,color=as.factor(Target))) +
  geom_point(alpha = 0.3) +
  labs(color="degree_of_poverty")+
  coord_flip()+
  ggtitle("Scatterplot -relationship_r4t2_hogar_adul") +
  theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p6,p7,nrow=2)

Data Clean

feature_list = c(
  "pared",
  "piso",
  "techo",
  "abasta",
  "sanitario",
  "energcocinar",
  "elimbasu",
  "epared",
  "etecho",
  "eviv",
  "estadocivil",
  "parentesco",
  "instlevel",
  "tipovivi",
  "lugar",
  "area"
)


new_features_integer = data.frame(matrix(ncol = length(feature_list), nrow = nrow(poverty)))

ohe_names = vector()

for(i in 1:length(feature_list)){
  
  feature_to_fix = poverty %>% select(starts_with(feature_list[i]))

  
  new_features_integer[,i] = as.integer(factor(names(feature_to_fix)[max.col(feature_to_fix)], ordered = FALSE))
  names(new_features_integer)[i] = paste0(feature_list[i],"_int")
  
  ohe_names = c(ohe_names, as.vector(names(feature_to_fix)))
  
}
for(i in 1:length(feature_list)){
  poverty =poverty %>% select(-starts_with(feature_list[i]))
}

poverty=as_tibble(cbind(poverty , new_features_integer))

new_features_integer = data.frame(matrix(ncol = length(feature_list), nrow = nrow(poverty_test)))

for(i in 1:length(feature_list)){
  

  feature_to_fix = poverty_test %>% select(starts_with(feature_list[i]))
  
 
  new_features_integer[,i] = as.integer(factor(names(feature_to_fix)[max.col(feature_to_fix)], ordered = FALSE))
  names(new_features_integer)[i] = paste0(feature_list[i],"_int")
  
  ohe_names = c(ohe_names, as.vector(names(feature_to_fix)))
  
}
for(i in 1:length(feature_list)){
  poverty_test =poverty_test %>% select(-starts_with(feature_list[i]))
}

poverty_test=as_tibble(cbind(poverty_test , new_features_integer))

Correlation

Following is the correlation plot of all the 142 variables. For each pair of the variables, we can see the correlation. If the color is darker, it indicates that the the degree of correlation is deeper. Blue means the positive correlation and red means the negative correlation. Notice that some of the variables such as age and SQBage is highly related because SQBage is age squared. Thus, when building the model, we should deal with it. For example, we can choose one from SQBage and age since they carry similar information.

corrma=round(cor(poverty %>% select(-v2a1,-v18q1,-rez_esc,-dependency,-edjefa) %>% na.omit()%>% select_if(is.numeric)),2)

corrplot(corrma,order = "hclust", tl.cex = 0.5)

PCA

PCA

The most important vairable (PC1) and second most important variable(PC2) explain 16.8% of the variances of the dataset in total.

From the graph, We want to interpret these two variables:

PC1 is largely related with numnber of people in the households and how crowded the home is.

Some contributing variables are:

hogar_total, # of total individuals in the household

r4t3, Total persons in the household

hhsize, household size

hogar_nin, Number of children 0 to 19 in household

overcrowding # persons per room

PC2 is largely related with education:

Some contributing variables are:

esolari, years of schooling

SQBescolari, escolari squared

The first 10 principal components chosen by the PCA explain 36.9% of the dataset variances, which is pretty significant in our model.

poverty_pca <- read_csv(file = "Train.csv")
poverty_pca_test <-read_csv(file ="test.csv")

poverty_pca$tag <-"Train"
poverty_pca_test$tag <-"Test"

poverty_pca_test $Target <-1
poverty_pca_total <- rbind(poverty_pca,poverty_pca_test)
poverty_pca_total <- poverty_pca_total %>% mutate(Target=as.factor(Target))
poverty_pca_total$v2a1 <-NULL  #NA
poverty_pca_total$v18q1 <-NULL  #NA
poverty_pca_total$rez_esc <-NULL #NA   #33413
poverty_pca_total$meaneduc <-NULL
poverty_pca_total$SQBmeaned <-NULL

poverty_pca_total$dependency <- NULL
poverty_pca_total$idhogar <-NULL

full = as_tibble(rbind(poverty_pca %>% select,poverty_pca_test))

# Create a list of the features names that need to be reverse engineered

feature_list = c(
  "pared",
  "piso",
  "techo",
  "abasta",
  "sanitario",
  "energcocinar",
  "elimbasu",
  "epared",
  "etecho",
  "eviv",
  "estadocivil",
  "parentesco",
  "instlevel",
  "tipovivi",
  "lugar",
  "area"
)

new_features_integer = data.frame(matrix(ncol = length(feature_list), nrow = nrow(full)))

ohe_names = vector()

for(i in 1:length(feature_list)){
  
  feature_to_fix = full %>% select(starts_with(feature_list[i]))
  
  new_features_integer[,i] = as.integer(factor(names(feature_to_fix)[max.col(feature_to_fix)], ordered = FALSE))
  names(new_features_integer)[i] = paste0(feature_list[i],"_int")
  
  ohe_names = c(ohe_names, as.vector(names(feature_to_fix)))
  
}

poverty_pca_numerical = poverty_pca_total %>% select_if(is.numeric)  ## get the numerical columns 
poverty_pca_numerical$elimbasu5 <- NULL

full_pca = prcomp(poverty_pca_numerical, center = TRUE, scale. = TRUE)
fviz_eig(full_pca, addlabels=TRUE)

fviz_pca_var(full_pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             select.var = list(contrib = 20), # top 20 contributing
             repel = TRUE,     # Avoid text overlapping
             title="PCA-Plot")

poverty_pca_total= poverty_pca_total %>% 
  cbind(full_pca$x[,1:10])

Data Preparation

Read data from files, and prepare it for the modeling part. Split the training set into 2 sets with 0.7 and 0.3.

Because Kaggle don’t provide us with the test set with actual values in it.

#randomForest
impute.zero <- function(x) replace(x, is.na(x), 0)
poverty_rf=poverty %>% select(-v2a1,-v18q1,-rez_esc,-Id,-idhogar,-dependency,-edjefa) %>% mutate(SQBmeaned=impute.zero(SQBmeaned),
                                                                                         meaneduc=impute.zero(meaneduc))
poverty_rf$Target=as.factor(poverty_rf$Target)

#GBM
poverty_gbm <- read_csv("Train.csv") %>%
  mutate(Target = as.factor(Target)) %>% 
  select(-dependency,-Id,-idhogar,-v2a1,-v18q1,-rez_esc,-SQBmeaned,-meaneduc)  

poverty_test_gbm <- poverty_test %>%
  select(-dependency,-idhogar,-v2a1,-v18q1,-rez_esc,-SQBmeaned,-meaneduc) 

#Logistic
poverty_pca_train <- subset(poverty_pca_total, tag=="Train")
poverty_pca_test <- subset(poverty_pca_total, tag=="Test")
poverty_pca_test$Target <-NULL

#Data Split
seed <- 100
set.seed(seed)

inTraining <- createDataPartition(poverty_rf$Target, p=0.7, list=FALSE)
poverty_rf_training=poverty_rf[inTraining,]
poverty_rf_validation=poverty_rf[-inTraining,]

inTraining <- createDataPartition(poverty$Target, p=0.7, list=FALSE)
poverty_gbm_training=poverty_gbm[inTraining,]
poverty_gbm_validation=poverty_gbm[-inTraining,]

inTraining <-createDataPartition(poverty_pca_train$Target,p=0.7,list=FALSE)
poverty_pca_train_training = poverty_pca_train[inTraining,]
poverty_pca_train_validation = poverty_pca_train[-inTraining,]

Random Forest

Limitation

We splilt the training data into a training set (70%) and (30%), and replace the NAs with 0s. Then We first train the random forest model with tuneLength 5, and choose the mtry with highest accurary, that is, 75 for the following trainings. However, due to the limited computation resources, this time we are not able to conduct a more precise search of parameters and a cross validation to reduce overfitting.

impute.zero <- function(x) replace(x, is.na(x), 0)
poverty_rf=poverty %>% select(-v2a1,-v18q1,-rez_esc,-Id,-idhogar,-dependency,-edjefa) %>% mutate(SQBmeaned=impute.zero(SQBmeaned),
                                                                                         meaneduc=impute.zero(meaneduc))
poverty_rf$Target=as.factor(poverty_rf$Target)

seed <- 100
set.seed(seed)

inTraining <- createDataPartition(poverty_rf$Target, p=0.7, list=FALSE)
poverty_rf_training=poverty_rf[inTraining,]
poverty_rf_validation=poverty_rf[-inTraining,]

#control <- trainControl(method="repeatedcv", number=10, repeats=3)
tTrace=trainControl(verboseIter = TRUE)

metric <- "Accuracy"

mtry <- 75
tunegrid <- expand.grid(.mtry=mtry)

#For time saving reason, we are not going to run the model again

#rf_fit3 <- train(Target ~ ., data = poverty_rf_training, method = "rf", metric = "Accuracy", tuneGrid=tunegrid,trControl=tTrace)

#print(rf_fit3)
#save(rf_fit3,file="rf_fit3.rda")


load("rf_fit3.rda")
pred_validation <- predict(rf_fit3, newdata=poverty_rf_validation)
confusionMatrix(data=pred_validation, poverty_rf_validation$Target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4
##          1  178    3    0    1
##          2   18  416   19    6
##          3    3   11  303    3
##          4   27   49   40 1788
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9372          
##                  95% CI : (0.9277, 0.9458)
##     No Information Rate : 0.6276          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.883           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity           0.78761   0.8685   0.8370   0.9944
## Specificity           0.99848   0.9820   0.9932   0.8913
## Pos Pred Value        0.97802   0.9063   0.9469   0.9391
## Neg Pred Value        0.98211   0.9738   0.9768   0.9896
## Prevalence            0.07888   0.1672   0.1264   0.6276
## Detection Rate        0.06213   0.1452   0.1058   0.6241
## Detection Prevalence  0.06353   0.1602   0.1117   0.6646
## Balanced Accuracy     0.89305   0.9252   0.9151   0.9429
poverty_rf_test=poverty_test%>% select(-v2a1,-v18q1,-rez_esc,-idhogar,-dependency,-edjefa) %>% mutate(SQBmeaned=impute.zero(SQBmeaned),
                                                                                         meaneduc=impute.zero(meaneduc))
pred <- predict(rf_fit3, newdata=poverty_rf_test)
result=data.frame(poverty_rf_test$Id,pred)
colnames(result)=c("Id","Target")
write.csv(result,"rf_submission.csv",row.names=FALSE)

Random Forest Kaggle Scores

Here is my result

Here is my result

From the confustion matrix we get from the prediction, the overal accuracy is pretty high, but the prediction accuracy is not balanced among the 4 levels. The prediction accuracy decreases when predicting the poorer class, although 89% accuracy is good enough.

For the extremley poor class (Class 1), the model has a sensitivity of 78.8%, which means the model correctly predicts 78.8% of all the times when the household is actually in the extremely poor class (Class 1).

varImp(rf_fit3)
## rf variable importance
## 
##   only 20 most important variables shown (out of 85)
## 
##                  Overall
## meaneduc          100.00
## SQBmeaned          98.18
## SQBdependency      60.26
## qmobilephone       44.91
## pared_int          41.69
## lugar_int          39.51
## SQBhogar_nin       38.27
## hogar_nin          36.82
## rooms              35.74
## overcrowding       32.29
## SQBovercrowding    31.76
## eviv_int           31.49
## SQBedjefe          31.08
## tipovivi_int       28.70
## energcocinar_int   26.03
## r4h2               25.75
## r4t1               25.47
## piso_int           25.01
## epared_int         24.80
## r4t2               24.69
impvars <- varImp(rf_fit3)
plot(impvars, main = "Variable Importance for Random Forest")

From the variable importance plot, we can see the top 2 important variables are all about education (average years of education for adults), then followed by dependency ratio, mobile phone amounts, and the predominant material on the outside. In the below box plots, we can find strong relationships between these features and the label.

p8 <- ggplot(poverty_rf, aes(x = as.factor(Target), y = meaneduc)) +
  geom_boxplot(aes(fill=as.factor(Target))) +
  labs(fill="degree_of_poverty")+
  coord_flip()+
  labs( x="degree of poverty")+
  ggtitle("Frequency distribution - years of education") +
  theme(plot.title = element_text(hjust = 0.5))

p9 <- ggplot(poverty_rf, aes(x = as.factor(Target), y = SQBdependency)) +
  geom_boxplot(aes(fill=as.factor(Target))) +
  labs(fill="degree_of_poverty")+
  coord_flip()+
  labs( x="degree of poverty")+
  ggtitle("Frequency distribution - dependency ratio squared") +
  theme(plot.title = element_text(hjust = 0.5))

p10 <- ggplot(poverty_rf, aes(x = as.factor(Target), y = qmobilephone)) +
  geom_boxplot(aes(fill=as.factor(Target))) +
  labs(fill="degree_of_poverty")+
  coord_flip()+
  labs( x="degree of poverty")+
  ggtitle("Frequency distribution - number of mobile phones") +
  theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p8,p9,p10,nrow=3)

GBM

Limitations

Since we have a limit background knowledge of the mathematical meaning behind the Gbm model, we don’t know much about how to adjust the parameters in the Gbm model. Thus, we didn’t adjust a lot to improve Gbm model, but merely try our best to run it.

Modelling Process

Since some of the variables have a lot of NA, and from the EDA process we found out that some of them such as v2a1, v18q1, rez_esc, SQBmeaned and meaneduc has too many NAs that we must delete those columns. Also, variables such as edjefa and idhogar also has some NAs and we know that they are not so important to the degree of poverty based on EDA. Finally, Id has nothing related with the degree of poverty, so we delete all the variables mentioned above when doing Gbm model.

We saved the model as rda file in order to reuse it next time so we comment the code of the model as follow.

#poverty_gbm <- expand.grid(interaction.depth = 5,
#                        n.trees = 100,
#                        shrinkage = 0.1,
#                        n.minobsinnode = 10)

#final_Gbm <- train(Target ~., data = poverty_gbm_training, method = "gbm",
#                   tuneGrid = poverty_gbm,
#                   verbose = TRUE)
poverty_gbm_test=read_csv("test.csv")%>%
  select(-dependency,-idhogar,-v2a1,-v18q1,-rez_esc,-SQBmeaned,-meaneduc) 
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   Id = col_character(),
##   v2a1 = col_double(),
##   idhogar = col_character(),
##   dependency = col_character(),
##   edjefe = col_character(),
##   edjefa = col_character(),
##   meaneduc = col_double(),
##   overcrowding = col_double(),
##   SQBovercrowding = col_double(),
##   SQBdependency = col_double(),
##   SQBmeaned = col_double()
## )
## See spec(...) for full column specifications.
load("2final_Gbm_.rda")

varImp(final_Gbm)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 153)
## 
##               Overall
## SQBdependency 100.000
## cielorazo      42.418
## SQBedjefe      39.653
## hogar_nin      36.656
## overcrowding   33.353
## r4t1           33.071
## qmobilephone   32.331
## escolari       31.348
## eviv3          29.223
## v18q           27.289
## pisomoscer     14.245
## rooms          12.843
## paredblolad    11.810
## r4t2           11.748
## hogar_adul      9.426
## r4m1            8.394
## epared3         7.857
## etecho3         7.157
## r4h2            6.471
## r4m2            5.726

GBM Kaggle Result

Here is my result

Variable Importance:

From the variable importance plot, we can see that the top three important variables are SQBdependncy, cielorazo, and overcrowding. The above variables mean that dependency, the house quality, the number of children and the degree of overcrowding affect the degree of poverty a lot. Note that when we did EDA and drew the boxplot, we did find out that overcrowding may be important variable, and the result in Gbm model really shows that we were right.

Also, note that from the boxplots below, we can know that SQBdependncy and overcrowding have the negative relationship with Target. In actual world, it makes sense because when people have more dependencies and are overcrowded when living, they tend to be poor. What’s more, when the quality of house is good, they tend not to be poor.

impvars <- varImp(final_Gbm)
plot(impvars, main = "Variable Importance for GBM")

p11 <- ggplot(poverty_gbm_training, aes(x = as.factor(Target), y = overcrowding)) +
  geom_boxplot(aes(fill=as.factor(Target))) +
  labs(fill="degree_of_poverty")+
  coord_flip()+
  labs( x="degree of poverty")+
  ggtitle("Frequency distribution - number of persons per room") +
  theme(plot.title = element_text(hjust = 0.5))

p12 <- ggplot(poverty_gbm_training, aes(x = as.factor(Target), y = SQBdependency)) +
  geom_boxplot(aes(fill=as.factor(Target))) +
  labs(fill="degree_of_poverty")+
  coord_flip()+
  labs( x="degree of poverty")+
  ggtitle("Frequency distribution - dependency ratio squared") +
  theme(plot.title = element_text(hjust = 0.5))

p13 <- ggplot(poverty_gbm_training, aes(x = as.factor(Target), y = cielorazo)) +
  geom_boxplot(aes(fill=as.factor(Target))) +
  labs(fill="degree_of_poverty")+
  coord_flip()+
  labs( x="degree of poverty")+
  ggtitle("Frequency distribution - the area of the ceiling") +
  theme(plot.title = element_text(hjust = 0.5))
grid.arrange(p11,p12,p13,nrow=3)

pred_Gbm <- predict(object = final_Gbm, newdata = poverty_gbm_validation, type = "raw")
confusionMatrix(pred_Gbm, poverty_gbm_validation$Target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4
##          1  123    7    6    5
##          2   42  275   52   66
##          3    2    6  112   10
##          4   86  182  175 1717
## 
## Overall Statistics
##                                           
##                Accuracy : 0.777           
##                  95% CI : (0.7613, 0.7922)
##     No Information Rate : 0.6274          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5473          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity           0.48617  0.58511  0.32464   0.9549
## Specificity           0.99311  0.93322  0.99286   0.5852
## Pos Pred Value        0.87234  0.63218  0.86154   0.7949
## Neg Pred Value        0.95229  0.91979  0.91484   0.8853
## Prevalence            0.08828  0.16399  0.12038   0.6274
## Detection Rate        0.04292  0.09595  0.03908   0.5991
## Detection Prevalence  0.04920  0.15178  0.04536   0.7537
## Balanced Accuracy     0.73964  0.75916  0.65875   0.7701
poverty_test_id<-poverty_test %>% 
  select(Id)

predTest_Gbm <- predict(object = final_Gbm, newdata = poverty_gbm_test, type = "raw")
result=data.frame(poverty_test_id,predTest_Gbm)
colnames(result)=c("Id","Target")
write.csv(result,"gbm_submission.csv",row.names=FALSE)

Confusion Matrix:

From the confusion matrix, what we noticed is that the accuracy of Gbm model is 77.7%, not bad. As for the sensitivity, since our goal is to predict the poor people in the country, and the lower the number is, the poorer the people are, so number “1” means the poorest people. And our sensitivity for “1” is 48.61%, meaning that for for all the class 1 in the real world, 48.51% we predict is “1”, not very good. As for PPV, the PPV for “1” is 87.23%, meaning that for what we predict as “1”, 87.23% of them is actually “1”.

Multinomial Logistic Regression

In the dataset we are working on, the dependent variables have 4 different levels, instead of 2. So we use the multinomial logistics regression model.

Coefficient analysis

With PC1, it has a negative relationship with the poverty level. The coefficient PC1 of multinomial logistics is negative, suggesting that with PC1, individuals are more likely to be turning to the poorer level. Remember that in the EDA section, we discovered that PC1 is related with overcrowding of the household. That suggests that if household is more crowded, than the household is more likely on average to be poorer.

The coefficient PC2 of multinomial logistics regression is positive, suggesting that with PC2, individuals are more likely to be turning to the non vulnerable level. And again in the EDA section, we discovered that PC2 is related with years of schooling. That suggests that with more years of schooling, the household is more likely on average to be richer.

All of the 2 coefficient analysis makes sense intuitively: When households are poorer, their homes are more likely to be crowded and they have more people in a home. When households are richer, they have longer years of schooling because they are able to afford the huge amount of tuitions and living costs.

Again when we look back to the EDA: we are able to find similar patterns.

poverty_pca_multinolog <- multinom(Target ~PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10, data=poverty_pca_train_training)
## # weights:  48 (33 variable)
## initial  value 9277.081865 
## iter  10 value 7057.336356
## iter  20 value 6886.348347
## iter  30 value 6224.239883
## iter  40 value 6005.678785
## final  value 6005.634205 
## converged
summary(poverty_pca_multinolog)
## Call:
## multinom(formula = Target ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + 
##     PC7 + PC8 + PC9 + PC10, data = poverty_pca_train_training)
## 
## Coefficients:
##   (Intercept)         PC1        PC2         PC3        PC4         PC5
## 2   0.9477359 -0.02402320 0.06487031 -0.07725264 0.08130806 -0.03560469
## 3   0.8714587 -0.08413938 0.18347897 -0.11019816 0.16987582 -0.04125267
## 4   2.4799135 -0.26425112 0.44532696 -0.03214798 0.17913052 -0.10270996
##           PC6         PC7         PC8         PC9        PC10
## 2 -0.04310654 -0.07057798  0.01898095 -0.04253850 -0.03061731
## 3 -0.07378743 -0.01822965 -0.08580160 -0.06949967 -0.04727556
## 4 -0.20364789  0.02925247 -0.10679269  0.05668073  0.01535440
## 
## Std. Errors:
##   (Intercept)        PC1        PC2        PC3        PC4        PC5
## 2  0.07473672 0.01332934 0.02096981 0.02273967 0.02351854 0.02772882
## 3  0.07502132 0.01480270 0.02216703 0.02427265 0.02641129 0.02971372
## 4  0.06649153 0.01403029 0.02035875 0.02174776 0.02371727 0.02642655
##          PC6        PC7        PC8        PC9       PC10
## 2 0.02715851 0.03035254 0.03196843 0.03390007 0.02894221
## 3 0.03042163 0.03275026 0.03448225 0.03638009 0.03376130
## 4 0.02732638 0.02892268 0.03068407 0.03239020 0.03024610
## 
## Residual Deviance: 12011.27 
## AIC: 12077.27
poverty_pca_log_prediction_train <- predict(object = poverty_pca_multinolog, newdata = poverty_pca_train_training) 
## table(actual = poverty_pca_train$Target, predict = poverty_pca_log_prediction_train)
## confusionMatrix(poverty_pca_log_prediction_train, poverty_pca_train$Target)

Confusion Matrix Interpretation

In the validation dataset, the model has a overall accuracy of 0.6482. Out of the predictions of Extreme Poverty,Moderate Porverty, Vulnerable Households and Non Vulnerable Households, we got 64.82% of the predictions correctly.

For the prediction of Extreme poverty(Class 1), the model has a sensitivity of 2.2%, a Pos Pred Value of 41.67%. This suggsts that

  1. Out of individuals who are actually extremely poor, we successfully predict about 2.2% of the times. The model does not perform very well in terms of this metric.

  2. Out of all the individuals that we predict to be extremely poor, 41.6% are actually extreme poor.

poverty_pca_log_prediction_validation <- predict(object = poverty_pca_multinolog, newdata =poverty_pca_train_validation)
confusionMatrix(poverty_pca_log_prediction_validation, poverty_pca_train_validation$Target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4
##          1   14   14    0    5
##          2   98  138   60  111
##          3    0    0    2    1
##          4  114  327  300 1681
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6405          
##                  95% CI : (0.6226, 0.6581)
##     No Information Rate : 0.6276          
##     P-Value [Acc > NIR] : 0.07895         
##                                           
##                   Kappa : 0.1915          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2  Class: 3 Class: 4
## Sensitivity          0.061947  0.28810 0.0055249   0.9349
## Specificity          0.992800  0.88726 0.9996005   0.3055
## Pos Pred Value       0.424242  0.33907 0.6666667   0.6941
## Neg Pred Value       0.925141  0.86127 0.8742138   0.7359
## Prevalence           0.078883  0.16719 0.1263525   0.6276
## Detection Rate       0.004887  0.04817 0.0006981   0.5867
## Detection Prevalence 0.011518  0.14206 0.0010471   0.8454
## Balanced Accuracy    0.527374  0.58768 0.5025627   0.6202
poverty_pca_log_prediction_test <- predict(object = poverty_pca_multinolog, newdata =poverty_pca_test)


result <-data.frame(poverty_pca_test$Id,poverty_pca_log_prediction_test)
colnames(result) <-c("Id","Target")
write.csv(result,"submission_shuting.csv",row.names = FALSE)

Logistic Kaggle Result

Here is my result

p13 <- ggplot(poverty_pca_train_training, aes(x = as.factor(Target), y = PC1)) +
  geom_boxplot(aes(fill=as.factor(Target))) +
  labs(fill="degree_of_poverty")+
  coord_flip()+
  labs( x="degree of poverty")+
  ggtitle("Frequency distribution - principle component 1") +
  theme(plot.title = element_text(hjust = 0.5))
p13

Business Recommendations

Based on our EDA, models and findings anlayses, we have the following recommendations in response to the business questions for the project.

1) Recommend variables that are related to schooling, household crowding, house quality, and number of cellphones owned as the most important variables from both the models and EDA analysis.

From the models we found that:
More schooling generally indicates that households are less non vunlenerable.
schooling related variables include:
Meaneduc,the degree of education

household crowding related variables: household being crowded generally indicates that households are more impoverished.
r4t3, Total persons in the household;
overcrowding # persons per room

house quality related variables: worse house quality generally indicates that households are more impoverished
Cielorazo, the area of the ceilin

number of cellpones owned: more number of cellphones owned generally indicates that households are less non impoverished.
Qmobilephone

These interpretations from the models are very similar to the ones we got from the boxplots we previously drawn.

These interpretations may seem to be very intuitive to what we usually think. But this helps us to identify the most important variables to look at when predicting poverty levels. Thus, they do provide some level of practical senses.

2) Recommend the random forest model for the Inter-American Development Bank to identify impoverished individuals for the bank to provide the aiding programs.

As for the model overall accuracy, logistics regression has an accuracy of 64.82%, rf has an accuracy of 93.72%, Gbm rf has an accuracy of 77.7%.

When talking about sensitivity for Class 1, logistics regression has sensitivity of 2.21%, rf has sensitivity of 78.76%, Gbm has sensitivity of 48.62%.

As for PPV for Class 1, logistics regression has PPV of 41.67%, rf has PPV of 97.8%, Gbm has PPV of 87.23%.

When it goes to kappa, logistics regression has kappa of 0.1892, rf has kappa of 0.883, Gbm has kappa of 0.5473.

When it comes to the scores on Kaggle, logistics regression has 0.29, rf has 0.375, Gbm has 0.368.

Overall, we recommend the Random Forest Model, because it has a highest accuracy, highest sensitivity(Class1), highest PPV (Class1), highest Kappa, and highest Kaggle scores.

Vision of Our Project

Our recommendations will help the Inter-American Development Bank identify the impoverished households and improve social inequality by providing sufficient aids to people who are in need of. At the same time, we are making sure that the bank is not wasting their precious aids to non-vulnerable people who do not need the money. Compared to the previous PMT model that the bank is using, our random forest model Although the project is primarily targeted at households in Costa Rica, the models can definitely be applied to other countries and regions in the future, and help the institution to expand its impact in improving inequality issues.

APPENDIX - How to Choose Models?

This problem is a typical multi-level classification problem, and a lot of features in its dataset have already been dummified - it’s natural to consider Logistic Regression as the predictive model. In addition, GBM, Random Forest and SVM have robust performances on classification problems, therefore we also consider applying them in our problem-solving process. At last, we select GBM and Random Forest as additional models. After training models and submitting the results, the Random Forest model stands out for its higher accuracy.